Disease Prediction based on symptoms using NLP¶

Data Source¶

https://archive.ics.uci.edu/dataset/462/drug+review+dataset+drugs+com

Importing libraries¶

In [ ]:
import os
import zipfile
import pickle
import pandas as pd 
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline

# To show all the rows of pandas dataframe
pd.set_option('display.max_rows', None)

# To set the width of the column to maximum
pd.set_option('max_colwidth', 1)

Importing dataset¶

In [ ]:
def unzip_file(zip_path, extract_to):
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extract_to)
In [ ]:
zip_file_path = 'dataset-zip.zip'  
extraction_path = os.getcwd()  
unzip_file(zip_file_path, extraction_path)
In [ ]:
train=pd.read_csv('dataset/drugsComTrain_raw.csv')
test=pd.read_csv('dataset/drugsComTest_raw.csv')
df=pd.concat([train,test])
In [ ]:
# df = pd.read_csv('dataset/mydrugsmerged.csv')

Exploratory Data Analysis¶

In [ ]:
df.shape
Out[ ]:
(215063, 7)
In [ ]:
df.head()
Out[ ]:
uniqueID drugName condition review rating date usefulCount
0 206461 Valsartan Left Ventricular Dysfunction "It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil" 9 May 20, 2012 27
1 95260 Guanfacine ADHD "My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. \nWe have tried many different medications and so far this is the most effective." 8 April 27, 2010 192
2 92703 Lybrel Birth Control "I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it's the end of the third week- I still have daily brown discharge.\nThe positive side is that I didn't have any other side effects. The idea of being period free was so tempting... Alas." 5 December 14, 2009 17
3 138000 Ortho Evra Birth Control "This is my first time using any form of birth control. I'm glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch" 8 November 3, 2015 10
4 35696 Buprenorphine / naloxone Opiate Dependence "Suboxone has completely turned my life around. I feel healthier, I'm excelling at my job and I always have money in my pocket and my savings account. I had none of those before Suboxone and spent years abusing oxycontin. My paycheck was already spent by the time I got it and I started resorting to scheming and stealing to fund my addiction. All that is history. If you're ready to stop, there's a good chance that suboxone will put you on the path of great life again. I have found the side-effects to be minimal compared to oxycontin. I'm actually sleeping better. Slight constipation is about it for me. It truly is amazing. The cost pales in comparison to what I spent on oxycontin." 9 November 27, 2016 37
In [ ]:
df.condition.value_counts().head(20)
Out[ ]:
condition
Birth Control                38436
Depression                   12164
Pain                         8245 
Anxiety                      7812 
Acne                         7435 
Bipolar Disorde              5604 
Insomnia                     4904 
Weight Loss                  4857 
Obesity                      4757 
ADHD                         4509 
Diabetes, Type 2             3362 
Emergency Contraception      3290 
High Blood Pressure          3104 
Vaginal Yeast Infection      3085 
Abnormal Uterine Bleeding    2744 
Bowel Preparation            2498 
Smoking Cessation            2440 
ibromyalgia                  2370 
Migraine                     2277 
Anxiety and Stress           2236 
Name: count, dtype: int64
In [ ]:
# column_to_count = df['review']  # Replace 'column_name' with the actual column name

# # Step 3: Calculate the total number of words
# total_words = column_to_count.str.split().apply(len).sum()

# print(f"Total number of words in the {column_to_count} column: {total_words}")
In [ ]:
# exploring unique elements in the dataset
print("Number of Unique Drugs present in the Dataset : ", df['drugName'].nunique())
print("Number of Unique Medical Conditions present in the Dataset : ", df['condition'].nunique())
Number of Unique Drugs present in the Dataset :  3671
Number of Unique Medical Conditions present in the Dataset :  916
In [ ]:
# checking for missing values
df.isna().sum()
Out[ ]:
uniqueID       0   
drugName       0   
condition      1194
review         0   
rating         0   
date           0   
usefulCount    0   
dtype: int64
In [ ]:
# dropping the missing values of the conditions
df = df.dropna()
df.isna().sum()
Out[ ]:
uniqueID       0
drugName       0
condition      0
review         0
rating         0
date           0
usefulCount    0
dtype: int64
In [ ]:
df['drugName'].value_counts().head(10)
Out[ ]:
drugName
Levonorgestrel                        4930
Etonogestrel                          4421
Ethinyl estradiol / norethindrone     3753
Nexplanon                             2892
Ethinyl estradiol / norgestimate      2790
Ethinyl estradiol / levonorgestrel    2503
Phentermine                           2085
Sertraline                            1868
Escitalopram                          1747
Mirena                                1673
Name: count, dtype: int64
In [ ]:
df['condition'].value_counts().head(20).plot(kind='barh', figsize=(10, 4))
Out[ ]:
<Axes: ylabel='condition'>
No description has been provided for this image
In [ ]:
import warnings
warnings.filterwarnings("ignore")

plt.rcParams['figure.figsize'] = (10, 4)

plt.subplot(1, 2, 1)
sns.distplot(df['rating'])

plt.subplot(1, 2, 2)
sns.distplot(df['usefulCount'])

plt.suptitle('Distribution of Rating and Useful Count \n ', fontsize = 20)
plt.show()
No description has been provided for this image
In [ ]:
import seaborn as sns

# Define the custom linear gradient color palette
custom_palette = sns.color_palette("RdYlGn", 10)

plt.rcParams['figure.figsize'] = (10, 4)
sns.barplot(x=df['rating'], y=df['usefulCount'], palette=custom_palette)
plt.xlabel('\n Ratings')
plt.ylabel('Rated Useful Count\n', fontsize=20)
plt.title('\n Rating vs Usefulness \n', fontsize=20)
plt.show()
No description has been provided for this image
In [ ]:
# Aggregate the data to count the number of occurrences for each drug-condition pair
drug_condition_count = df.groupby(['condition', 'drugName']).size().reset_index(name='count')

# Sort the data based on the count to see the most common combinations
sorted_drug_condition_count = drug_condition_count.sort_values(by='count', ascending=False)

# Display the top 20 drug-condition pairs for a sense of the most common combinations
sorted_drug_condition_count.head(20)
Out[ ]:
condition drugName count
2185 Birth Control Etonogestrel 4394
2182 Birth Control Ethinyl estradiol / norethindrone 3081
2215 Birth Control Levonorgestrel 2884
2251 Birth Control Nexplanon 2883
2180 Birth Control Ethinyl estradiol / levonorgestrel 2107
2183 Birth Control Ethinyl estradiol / norgestimate 2097
3959 Emergency Contraception Levonorgestrel 1651
9276 Weight Loss Phentermine 1650
2196 Birth Control Implanon 1496
9159 Vaginal Yeast Infection Miconazole 1338
2244 Birth Control Mirena 1320
8566 Smoking Cessation Varenicline 1079
2287 Birth Control Skyla 1074
9172 Vaginal Yeast Infection Tioconazole 980
2221 Birth Control Lo Loestrin Fe 896
6525 Obesity Bupropion / naltrexone 888
6527 Obesity Contrave 864
8553 Smoking Cessation Chantix 857
2178 Birth Control Ethinyl estradiol / etonogestrel 827
2259 Birth Control NuvaRing 824
In [ ]:
# To make a meaningful graph, let's first aggregate the counts of reviews per condition
condition_count = df['condition'].value_counts().reset_index()
condition_count.columns = ['condition', 'count']

# Select the top 10 conditions to keep the graph interpretable
top_conditions = condition_count.head(10)

# Plotting the top conditions by their count of reviews
plt.figure(figsize=(12, 8))
sns.barplot(x='count', y='condition', data=top_conditions, palette='coolwarm')
plt.title('Top 10 Conditions by Number of Reviews')
plt.xlabel('Number of Reviews')
plt.ylabel('Condition')
plt.show()
No description has been provided for this image
In [ ]:
import networkx as nx
import matplotlib.pyplot as plt
from networkx.drawing.nx_agraph import graphviz_layout

# Reduce the dataset to a manageable size by focusing on conditions with the most reviews
top_conditions_list = top_conditions['condition'].tolist()
reduced_df = df[df['condition'].isin(top_conditions_list)]

# Create a graph
G = nx.Graph()

# Add nodes and edges from the reduced dataset
for index, row in reduced_df.iterrows():
    condition = row['condition']
    drug = row['drugName']
    G.add_node(condition, type='condition')
    G.add_node(drug, type='drug')
    G.add_edge(condition, drug)
# Draw the network graph with labels for nodes to identify conditions and medications
plt.figure(figsize=(16, 12))
pos = nx.spring_layout(G, k=0.5, iterations=20)

# Nodes
nx.draw_networkx_nodes(G, pos, node_size=20, node_color='skyblue', alpha=0.6, label=[n for n in G.nodes if G.nodes[n]['type']=='condition'])
nx.draw_networkx_nodes(G, pos, node_size=20, node_color='lightgreen', alpha=0.6, label=[n for n in G.nodes if G.nodes[n]['type']=='drug'])

# Edges
nx.draw_networkx_edges(G, pos, alpha=0.4)

# Labels
nx.draw_networkx_labels(G, pos, font_size=7, font_color='darkblue')

plt.title('Enhanced Network Graph of Conditions and Medications with Labels')
plt.axis('off')
plt.show()
No description has been provided for this image
In [ ]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Data for bar chart (top conditions and drugs)
top_conditions_data = df['condition'].value_counts().head(10)
top_drugs_data = df['drugName'].value_counts().head(10)

# Mock-up data for model performance (assuming generic values)
accuracy = 0.85
precision = 0.8
recall = 0.75
f1_score = 0.77

# WordCloud for conditions
wordcloud_conditions = WordCloud(background_color='white', width=400, height=400).generate(' '.join(df['condition'].dropna().unique()))

# Bar chart for conditions
plt.figure(figsize=(8, 5))
top_conditions_data.plot(kind='barh', color='skyblue')
plt.title('Top 10 Conditions')
plt.tight_layout()
plt.show()

# Bar chart for drugs
plt.figure(figsize=(8, 5))
top_drugs_data.plot(kind='barh', color='lightgreen')
plt.title('Top 10 Drugs')
plt.tight_layout()
plt.show()

# Model performance summary
# plt.figure(figsize=(8, 5))
# plt.text(0.5, 0.8, f'Accuracy: {accuracy}', ha='center')
# plt.text(0.5, 0.6, f'Precision: {precision}', ha='center')
# plt.text(0.5, 0.4, f'Recall: {recall}', ha='center')
# plt.text(0.5, 0.2, f'F1 Score: {f1_score}', ha='center')
# plt.axis('off')
# plt.title('Model Performance')
# plt.tight_layout()
# plt.show()

# WordCloud for conditions
plt.figure(figsize=(8, 8))
plt.imshow(wordcloud_conditions, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Conditions')
plt.tight_layout()
plt.show()

#make labels bold
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [ ]:
wordcloud_medications = WordCloud(background_color='white', width=400, height=400).generate(' '.join(df['drugName'].dropna().unique()))
plt.figure(figsize=(8, 8))
plt.imshow(wordcloud_medications, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Medications')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
df_birth=df[(df['condition']=='Birth Control')]
df_dep=df[(df['condition']=='Depression')]
df_bp=df[(df['condition']=='High Blood Pressure')]
df_diab=df[(df['condition']=='Diabetes, Type 2')]
In [ ]:
from wordcloud import WordCloud
plt.figure(figsize = (20,20)) # Text that is Fake News Headlines
wc = WordCloud(max_words = 500 , width = 1600 , height = 800).generate(" ".join(df_birth.review))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('Word cloud for Birth control',fontsize=14)
Out[ ]:
Text(0.5, 1.0, 'Word cloud for Birth control')
No description has been provided for this image
In [ ]:
plt.figure(figsize = (20,20)) # Text that is Fake News Headlines
wc = WordCloud(max_words = 500 , width = 1600 , height = 800).generate(" ".join(df_dep.review))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('Word cloud for Depression',fontsize=14)
Out[ ]:
Text(0.5, 1.0, 'Word cloud for Depression')
No description has been provided for this image
In [ ]:
plt.figure(figsize = (20,20)) # Text that is Fake News Headlines
wc = WordCloud(max_words = 500 , width = 1600 , height = 800).generate(" ".join(df_bp.review))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('Word cloud for High Blood Pressure',fontsize=14)
Out[ ]:
Text(0.5, 1.0, 'Word cloud for High Blood Pressure')
No description has been provided for this image
In [ ]:
plt.figure(figsize = (20,20)) # Text that is Fake News Headlines
wc = WordCloud(max_words = 500 , width = 1600 , height = 800).generate(" ".join(df_diab.review))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('Word cloud for Diabetes Type 2',fontsize=14)
Out[ ]:
Text(0.5, 1.0, 'Word cloud for Diabetes Type 2')
No description has been provided for this image

Data Preprocessing¶

In [ ]:
df_train = df[(df['condition']=='Birth Control') | (df['condition']=='Depression') | (df['condition']=='High Blood Pressure')|(df['condition']=='Diabetes, Type 2') | (df['condition']=='Insomnia') | (df['condition']=='GERD') | (df['condition']=='Cough') | (df['condition']=='Acne') | (df['condition']=='Anxiety') | (df['condition']=='Constipation') | (df['condition']=='Migraine')]
In [ ]:
df_train.shape
Out[ ]:
(83806, 7)
In [ ]:
X = df_train.drop(['uniqueID','drugName','rating','date','usefulCount'],axis=1)
In [ ]:
X.shape
Out[ ]:
(83806, 2)
In [ ]:
X.condition.value_counts()
Out[ ]:
condition
Birth Control          38436
Depression             12164
Anxiety                7812 
Acne                   7435 
Insomnia               4904 
Diabetes, Type 2       3362 
High Blood Pressure    3104 
Migraine               2277 
Constipation           2120 
Cough                  1224 
GERD                   968  
Name: count, dtype: int64
In [ ]:
X.isna().sum()
Out[ ]:
condition    0
review       0
dtype: int64
In [ ]:
#check for duplicate values
X.duplicated().sum()
Out[ ]:
35975
In [ ]:
X.head()
Out[ ]:
condition review
2 Birth Control "I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it&#039;s the end of the third week- I still have daily brown discharge.\nThe positive side is that I didn&#039;t have any other side effects. The idea of being period free was so tempting... Alas."
3 Birth Control "This is my first time using any form of birth control. I&#039;m glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch"
9 Birth Control "I had been on the pill for many years. When my doctor changed my RX to chateal, it was as effective. It really did help me by completely clearing my acne, this takes about 6 months though. I did not gain extra weight, or develop any emotional health issues. I stopped taking it bc I started using a more natural method of birth control, but started to take it bc I hate that my acne came back at age 28. I really hope symptoms like depression, or weight gain do not begin to affect me as I am older now. I&#039;m also naturally moody, so this may worsen things. I was in a negative mental rut today. Also I hope this doesn&#039;t push me over the edge, as I believe I am depressed. Hopefully it&#039;ll be just like when I was younger."
11 Depression "I have taken anti-depressants for years, with some improvement but mostly moderate to severe side affects, which makes me go off them.\n\nI only take Cymbalta now mostly for pain.\n\nWhen I began Deplin, I noticed a major improvement overnight. More energy, better disposition, and no sinking to the low lows of major depression. I have been taking it for about 3 months now and feel like a normal person for the first time ever. Best thing, no side effects."
13 Cough "Have a little bit of a lingering cough from a cold. Not giving me much trouble except keeps me up at night. I heard this was good so I took so I could get some sleep. Helped tremendously with the cough but then I was having bad stomach cramps and diarrhea. I hadn&#039;t eaten anything that should have upset my stomach and it didn&#039;t really feel like a &quot;bug&quot; so I looked up side effects for Delsym. Now I wish I had done that first because I probably wouldn&#039;t have taken it. So, while it worked for my cough I still didn&#039;t get any sleep due to the stomach issues."
In [ ]:
X['review'][2]
Out[ ]:
'"I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it&#039;s the end of the third week- I still have daily brown discharge.\nThe positive side is that I didn&#039;t have any other side effects. The idea of being period free was so tempting... Alas."'
In [ ]:
X['review'][11]
Out[ ]:
'"I have taken anti-depressants for years, with some improvement but mostly moderate to severe side affects, which makes me go off them.\n\nI only take Cymbalta now mostly for pain.\n\nWhen I began Deplin, I noticed a major improvement overnight. More energy, better disposition, and no sinking to the low lows of major depression. I have been taking it for about 3 months now and feel like a normal person for the first time ever. Best thing, no side effects."'
In [ ]:
for i, col in enumerate(X.columns):
    X.iloc[:, i] = X.iloc[:, i].str.replace('"', '')
In [ ]:
X['review'][11]
Out[ ]:
'I have taken anti-depressants for years, with some improvement but mostly moderate to severe side affects, which makes me go off them.\n\nI only take Cymbalta now mostly for pain.\n\nWhen I began Deplin, I noticed a major improvement overnight. More energy, better disposition, and no sinking to the low lows of major depression. I have been taking it for about 3 months now and feel like a normal person for the first time ever. Best thing, no side effects.'
In [ ]:
X.head()
Out[ ]:
condition review
2 Birth Control I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it&#039;s the end of the third week- I still have daily brown discharge.\nThe positive side is that I didn&#039;t have any other side effects. The idea of being period free was so tempting... Alas.
3 Birth Control This is my first time using any form of birth control. I&#039;m glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch
9 Birth Control I had been on the pill for many years. When my doctor changed my RX to chateal, it was as effective. It really did help me by completely clearing my acne, this takes about 6 months though. I did not gain extra weight, or develop any emotional health issues. I stopped taking it bc I started using a more natural method of birth control, but started to take it bc I hate that my acne came back at age 28. I really hope symptoms like depression, or weight gain do not begin to affect me as I am older now. I&#039;m also naturally moody, so this may worsen things. I was in a negative mental rut today. Also I hope this doesn&#039;t push me over the edge, as I believe I am depressed. Hopefully it&#039;ll be just like when I was younger.
11 Depression I have taken anti-depressants for years, with some improvement but mostly moderate to severe side affects, which makes me go off them.\n\nI only take Cymbalta now mostly for pain.\n\nWhen I began Deplin, I noticed a major improvement overnight. More energy, better disposition, and no sinking to the low lows of major depression. I have been taking it for about 3 months now and feel like a normal person for the first time ever. Best thing, no side effects.
13 Cough Have a little bit of a lingering cough from a cold. Not giving me much trouble except keeps me up at night. I heard this was good so I took so I could get some sleep. Helped tremendously with the cough but then I was having bad stomach cramps and diarrhea. I hadn&#039;t eaten anything that should have upset my stomach and it didn&#039;t really feel like a &quot;bug&quot; so I looked up side effects for Delsym. Now I wish I had done that first because I probably wouldn&#039;t have taken it. So, while it worked for my cough I still didn&#039;t get any sleep due to the stomach issues.

Stopwords¶

In [ ]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\suhas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\suhas\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Out[ ]:
True
In [ ]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
In [ ]:
stop
Out[ ]:
['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 'not',
 'only',
 'own',
 'same',
 'so',
 'than',
 'too',
 'very',
 's',
 't',
 'can',
 'will',
 'just',
 'don',
 "don't",
 'should',
 "should've",
 'now',
 'd',
 'll',
 'm',
 'o',
 're',
 've',
 'y',
 'ain',
 'aren',
 "aren't",
 'couldn',
 "couldn't",
 'didn',
 "didn't",
 'doesn',
 "doesn't",
 'hadn',
 "hadn't",
 'hasn',
 "hasn't",
 'haven',
 "haven't",
 'isn',
 "isn't",
 'ma',
 'mightn',
 "mightn't",
 'mustn',
 "mustn't",
 'needn',
 "needn't",
 'shan',
 "shan't",
 'shouldn',
 "shouldn't",
 'wasn',
 "wasn't",
 'weren',
 "weren't",
 'won',
 "won't",
 'wouldn',
 "wouldn't"]
In [ ]:
#https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-qiqc kernel 
from wordcloud import WordCloud, STOPWORDS

# Thanks : https://www.kaggle.com/aashita/word-clouds-of-various-shapes ##
def plot_wordcloud(text, mask=None, max_words=200, max_font_size=100, figure_size=(24.0,16.0), 
                   title = None, title_size=40, image_color=False):
    stopwords = set(STOPWORDS)
    more_stopwords = {'one', 'br', 'Po', 'th', 'sayi', 'fo', 'Unknown'}
    stopwords = stopwords.union(more_stopwords)

    wordcloud = WordCloud(background_color='white',
                    stopwords = stopwords,
                    max_words = max_words,
                    max_font_size = max_font_size, 
                    random_state = 42,
                    width=800, 
                    height=400,
                    mask = mask)
    wordcloud.generate(str(text))
    
    plt.figure(figsize=figure_size)
    if image_color:
        image_colors = ImageColorGenerator(mask);
        plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear");
        plt.title(title, fontdict={'size': title_size,  
                                  'verticalalignment': 'bottom'})
    else:
        plt.imshow(wordcloud);
        plt.title(title, fontdict={'size': title_size, 'color': 'black', 
                                  'verticalalignment': 'bottom'})
    plt.axis('off');
    plt.tight_layout()  
    
plot_wordcloud(stop, title="Word Cloud of stops")
No description has been provided for this image
In [ ]:
not_stop = ["aren't","couldn't","didn't","doesn't","don't","hadn't","hasn't","haven't","isn't","mightn't","mustn't","needn't","no","nor","not","shan't","shouldn't","wasn't","weren't","wouldn't"]
for i in not_stop:
    stop.remove(i)

Lemmitization¶

In [ ]:
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

porter = PorterStemmer()

lemmatizer = WordNetLemmatizer()
In [ ]:
print(porter.stem("sportingly"))
print(porter.stem("very"))
print(porter.stem("troubled"))
sportingli
veri
troubl
In [ ]:
print(lemmatizer.lemmatize("sportingly"))
print(lemmatizer.lemmatize("very"))
print(lemmatizer.lemmatize("troubled"))
sportingly
very
troubled

Review Cleaning¶

In [ ]:
from bs4 import BeautifulSoup
import re
In [ ]:
def review_to_words(raw_review):
    # 1. Delete HTML 
    review_text = BeautifulSoup(raw_review, 'html.parser').get_text()
    # 2. Make a space
    letters_only = re.sub('[^a-zA-Z]', ' ', review_text)
    # 3. lower letters
    words = letters_only.lower().split()
    # 5. Stopwords 
    meaningful_words = [w for w in words if not w in stop]
    # 6. lemmitization
    lemmitize_words = [lemmatizer.lemmatize(w) for w in meaningful_words]
    # 7. space join words
    return( ' '.join(lemmitize_words))
In [ ]:
X['review_clean'] = X['review'].apply(review_to_words)
In [ ]:
X.head()
Out[ ]:
condition review review_clean
2 Birth Control I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it&#039;s the end of the third week- I still have daily brown discharge.\nThe positive side is that I didn&#039;t have any other side effects. The idea of being period free was so tempting... Alas. used take another oral contraceptive pill cycle happy light period max day no side effect contained hormone gestodene not available u switched lybrel ingredient similar pill ended started lybrel immediately first day period instruction said period lasted two week taking second pack two week third pack thing got even worse third period lasted two week end third week still daily brown discharge positive side side effect idea period free tempting ala
3 Birth Control This is my first time using any form of birth control. I&#039;m glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch first time using form birth control glad went patch month first decreased libido subsided downside made period longer day exact used period day max also made cramp intense first two day period never cramp using birth control happy patch
9 Birth Control I had been on the pill for many years. When my doctor changed my RX to chateal, it was as effective. It really did help me by completely clearing my acne, this takes about 6 months though. I did not gain extra weight, or develop any emotional health issues. I stopped taking it bc I started using a more natural method of birth control, but started to take it bc I hate that my acne came back at age 28. I really hope symptoms like depression, or weight gain do not begin to affect me as I am older now. I&#039;m also naturally moody, so this may worsen things. I was in a negative mental rut today. Also I hope this doesn&#039;t push me over the edge, as I believe I am depressed. Hopefully it&#039;ll be just like when I was younger. pill many year doctor changed rx chateal effective really help completely clearing acne take month though not gain extra weight develop emotional health issue stopped taking bc started using natural method birth control started take bc hate acne came back age really hope symptom like depression weight gain not begin affect older also naturally moody may worsen thing negative mental rut today also hope push edge believe depressed hopefully like younger
11 Depression I have taken anti-depressants for years, with some improvement but mostly moderate to severe side affects, which makes me go off them.\n\nI only take Cymbalta now mostly for pain.\n\nWhen I began Deplin, I noticed a major improvement overnight. More energy, better disposition, and no sinking to the low lows of major depression. I have been taking it for about 3 months now and feel like a normal person for the first time ever. Best thing, no side effects. taken anti depressant year improvement mostly moderate severe side affect make go take cymbalta mostly pain began deplin noticed major improvement overnight energy better disposition no sinking low low major depression taking month feel like normal person first time ever best thing no side effect
13 Cough Have a little bit of a lingering cough from a cold. Not giving me much trouble except keeps me up at night. I heard this was good so I took so I could get some sleep. Helped tremendously with the cough but then I was having bad stomach cramps and diarrhea. I hadn&#039;t eaten anything that should have upset my stomach and it didn&#039;t really feel like a &quot;bug&quot; so I looked up side effects for Delsym. Now I wish I had done that first because I probably wouldn&#039;t have taken it. So, while it worked for my cough I still didn&#039;t get any sleep due to the stomach issues. little bit lingering cough cold not giving much trouble except keep night heard good took could get sleep helped tremendously cough bad stomach cramp diarrhea eaten anything upset stomach really feel like bug looked side effect delsym wish done first probably taken worked cough still get sleep due stomach issue
In [ ]:
X['review_clean'][11]
Out[ ]:
'taken anti depressant year improvement mostly moderate severe side affect make go take cymbalta mostly pain began deplin noticed major improvement overnight energy better disposition no sinking low low major depression taking month feel like normal person first time ever best thing no side effect'

Creating features and Target Variable¶

In [ ]:
X_feat = X['review_clean']
y = X['condition']
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X_feat, y,stratify=y,test_size=0.2, random_state=0)

Bag of Words¶

In [ ]:
count_vectorizer = CountVectorizer(stop_words='english')

count_train = count_vectorizer.fit_transform(X_train)

count_test = count_vectorizer.transform(X_test)
In [ ]:
from sklearn.feature_extraction.text import CountVectorizer

# Assuming X_train is your training data

# Initialize the CountVectorizer with English stop words
count_vectorizer = CountVectorizer(stop_words='english')

# Fit and transform the training data
count_train = count_vectorizer.fit_transform(X_train)
In [ ]:
from sklearn.feature_extraction.text import CountVectorizer

# Example sentence from the image provided by the user.
sentence = df.head(10)['review'].tolist()

# Initialize CountVectorizer: Since the image shows that stopwords are not removed we won't remove them here
vectorizer = CountVectorizer()

# Fit the vectorizer to the data
bag_of_words = vectorizer.fit_transform(sentence)

# Get the feature names to use as columns in a DataFrame
feature_names = vectorizer.get_feature_names_out()

# Create a DataFrame for the bag of words matrix
df_bag_of_words = pd.DataFrame(bag_of_words.toarray(), columns=feature_names)

# Display the DataFrame
df_bag_of_words
Out[ ]:
039 15 21 230 26 28 2mg 2nd 2x 3rd ... work works worse worsen worth would years you younger zoloft
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 2 0 1 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
3 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 4 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 2 0 0
5 2 0 0 1 0 0 0 1 0 1 ... 2 0 0 0 1 1 0 1 0 0
6 1 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 1 1 1 0 0 0 1 0 1 0 ... 0 1 0 0 0 0 0 0 0 2
8 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9 3 0 0 0 0 1 0 0 0 0 ... 0 0 0 1 0 0 1 0 1 0

10 rows × 386 columns

In [ ]:
df.head(10)['review'].tolist()
Out[ ]:
['"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
 '"My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. \nWe have tried many different medications and so far this is the most effective."',
 '"I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it&#039;s the end of the third week- I still have daily brown discharge.\nThe positive side is that I didn&#039;t have any other side effects. The idea of being period free was so tempting... Alas."',
 '"This is my first time using any form of birth control. I&#039;m glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch"',
 '"Suboxone has completely turned my life around.  I feel healthier, I&#039;m excelling at my job and I always have money in my pocket and my savings account.  I had none of those before Suboxone and spent years abusing oxycontin.  My paycheck was already spent by the time I got it and I started resorting to scheming and stealing to fund my addiction.  All that is history.  If you&#039;re ready to stop, there&#039;s a good chance that suboxone will put you on the path of great life again.  I have found the side-effects to be minimal compared to oxycontin.  I&#039;m actually sleeping better.   Slight constipation is about it for me.  It truly is amazing. The cost pales in comparison to what I spent on oxycontin."',
 '"2nd day on 5mg started to work with rock hard erections however experianced headache, lower bowel preassure. 3rd day erections would wake me up &amp; hurt! Leg/ankles aches   severe lower bowel preassure like you need to go #2 but can&#039;t! Enjoyed the initial rockhard erections but not at these side effects or $230 for months supply! I&#039;m 50 &amp; work out 3Xs a week. Not worth side effects!"',
 '"He pulled out, but he cummed a bit in me. I took the Plan B 26 hours later, and took a pregnancy test two weeks later - - I&#039;m pregnant."',
 '"Abilify changed my life. There is hope. I was on Zoloft and Clonidine when I first started Abilify at the age of 15.. Zoloft for depression and Clondine to manage my complete rage. My moods were out of control. I was depressed and hopeless one second and then mean, irrational, and full of rage the next. My Dr. prescribed me 2mg of Abilify and from that point on I feel like I have been cured though I know I&#039;m not.. Bi-polar disorder is a constant battle. I know Abilify works for me because I have tried to get off it and lost complete control over my emotions. Went back on it and I was golden again.  I am on 5mg 2x daily. I am now 21 and better than I have ever been in the past. Only side effect is I like to eat a lot."',
 '" I Ve had  nothing but problems with the Keppera : constant shaking in my arms &amp; legs &amp; pins &amp; needles feeling in my arms &amp; legs severe light headedness no appetite &amp; etc."',
 '"I had been on the pill for many years. When my doctor changed my RX to chateal, it was as effective. It really did help me by completely clearing my acne, this takes about 6 months though. I did not gain extra weight, or develop any emotional health issues. I stopped taking it bc I started using a more natural method of birth control, but started to take it bc I hate that my acne came back at age 28. I really hope symptoms like depression, or weight gain do not begin to affect me as I am older now. I&#039;m also naturally moody, so this may worsen things. I was in a negative mental rut today. Also I hope this doesn&#039;t push me over the edge, as I believe I am depressed. Hopefully it&#039;ll be just like when I was younger."']

Directory Creation¶

In [ ]:
# Create the "models" folder if it doesn't exist
if not os.path.exists("models"):
    os.makedirs("models")
    
if not os.path.exists("vectorizers"):
    os.makedirs("vectorizers")

Sentiment Analysis¶

In [ ]:
from textblob import TextBlob
from tqdm import tqdm
reviews = df['review']

Predict_Sentiment = []
for review in tqdm(reviews):
    blob = TextBlob(review)
    Predict_Sentiment += [blob.sentiment.polarity]
df["Predict_Sentiment"] = Predict_Sentiment
df.head()
  0%|          | 0/213869 [00:00<?, ?it/s]100%|██████████| 213869/213869 [04:07<00:00, 862.94it/s] 
Out[ ]:
uniqueID drugName condition review rating date usefulCount Predict_Sentiment
0 206461 Valsartan Left Ventricular Dysfunction "It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil" 9 May 20, 2012 27 0.000000
1 95260 Guanfacine ADHD "My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. \nWe have tried many different medications and so far this is the most effective." 8 April 27, 2010 192 0.168333
2 92703 Lybrel Birth Control "I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it&#039;s the end of the third week- I still have daily brown discharge.\nThe positive side is that I didn&#039;t have any other side effects. The idea of being period free was so tempting... Alas." 5 December 14, 2009 17 0.067210
3 138000 Ortho Evra Birth Control "This is my first time using any form of birth control. I&#039;m glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch" 8 November 3, 2015 10 0.179545
4 35696 Buprenorphine / naloxone Opiate Dependence "Suboxone has completely turned my life around. I feel healthier, I&#039;m excelling at my job and I always have money in my pocket and my savings account. I had none of those before Suboxone and spent years abusing oxycontin. My paycheck was already spent by the time I got it and I started resorting to scheming and stealing to fund my addiction. All that is history. If you&#039;re ready to stop, there&#039;s a good chance that suboxone will put you on the path of great life again. I have found the side-effects to be minimal compared to oxycontin. I&#039;m actually sleeping better. Slight constipation is about it for me. It truly is amazing. The cost pales in comparison to what I spent on oxycontin." 9 November 27, 2016 37 0.194444
In [ ]:
df['sentiment'] = df["rating"].apply(lambda x: 1 if x > 5 else 0)
df.head()
Out[ ]:
uniqueID drugName condition review rating date usefulCount Predict_Sentiment sentiment
0 206461 Valsartan Left Ventricular Dysfunction "It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil" 9 May 20, 2012 27 0.000000 1
1 95260 Guanfacine ADHD "My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. \nWe have tried many different medications and so far this is the most effective." 8 April 27, 2010 192 0.168333 1
2 92703 Lybrel Birth Control "I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it&#039;s the end of the third week- I still have daily brown discharge.\nThe positive side is that I didn&#039;t have any other side effects. The idea of being period free was so tempting... Alas." 5 December 14, 2009 17 0.067210 0
3 138000 Ortho Evra Birth Control "This is my first time using any form of birth control. I&#039;m glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch" 8 November 3, 2015 10 0.179545 1
4 35696 Buprenorphine / naloxone Opiate Dependence "Suboxone has completely turned my life around. I feel healthier, I&#039;m excelling at my job and I always have money in my pocket and my savings account. I had none of those before Suboxone and spent years abusing oxycontin. My paycheck was already spent by the time I got it and I started resorting to scheming and stealing to fund my addiction. All that is history. If you&#039;re ready to stop, there&#039;s a good chance that suboxone will put you on the path of great life again. I have found the side-effects to be minimal compared to oxycontin. I&#039;m actually sleeping better. Slight constipation is about it for me. It truly is amazing. The cost pales in comparison to what I spent on oxycontin." 9 November 27, 2016 37 0.194444 1
In [ ]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

Machine Learning Model : Naive Bayes¶

In [ ]:
mnb = MultinomialNB()
mnb.fit(count_train, y_train)

with open("models/multinomial_nb_model.pkl", "wb") as file:
    pickle.dump(mnb, file)
    
pred = mnb.predict(count_test)
naive_bayes_score = metrics.accuracy_score(y_test, pred)
print("Accuracy :   %0.3f" % naive_bayes_score)
Accuracy :   0.903

Machine Learning Model : Passive Aggressive Classifier¶

In [ ]:
passive = PassiveAggressiveClassifier()
passive.fit(count_train, y_train)

with open("models/passive_aggressive_model.pkl", "wb") as file:
    pickle.dump(passive, file)
    
pred = passive.predict(count_test)
pass_aggr_score = metrics.accuracy_score(y_test, pred)
print("Accuracy:   %0.3f" % pass_aggr_score)
Accuracy:   0.928

Machine Learning Model : TFIDF¶

In [ ]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.8)
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)

pass_tf = PassiveAggressiveClassifier()
pass_tf.fit(tfidf_train, y_train)

with open("models/tfidf_model.pkl", "wb") as file:
    pickle.dump(pass_tf, file)
    
pred = pass_tf.predict(tfidf_test)
ml_tfidf_score = metrics.accuracy_score(y_test, pred)
print("Accuracy:   %0.3f" % ml_tfidf_score)
Accuracy:   0.939

TFIDF: Bigrams¶

In [ ]:
tfidf_vectorizer2 = TfidfVectorizer(stop_words='english', max_df=0.8, ngram_range=(1,2))
tfidf_train_2 = tfidf_vectorizer2.fit_transform(X_train)
tfidf_test_2 = tfidf_vectorizer2.transform(X_test)

pass_tf = PassiveAggressiveClassifier()
pass_tf.fit(tfidf_train_2, y_train)

with open("models/tfidf_bigrams_model.pkl", "wb") as file:
    pickle.dump(pass_tf, file)
   
with open("vectorizers/tfidf_vectorizer2.pkl", "wb") as f:
    pickle.dump(tfidf_vectorizer2, f)
    
pred = pass_tf.predict(tfidf_test_2)
bi_tfidf_score = metrics.accuracy_score(y_test, pred)
print("Accuracy:   %0.3f" % bi_tfidf_score)
Accuracy:   0.961

TFIDF : Trigrams¶

In [ ]:
tfidf_vectorizer3 = TfidfVectorizer(stop_words='english', max_df=0.8, ngram_range=(1,3)) #n-grams: string of elements like text 
tfidf_train_3 = tfidf_vectorizer3.fit_transform(X_train)
tfidf_test_3 = tfidf_vectorizer3.transform(X_test)

pass_tf = PassiveAggressiveClassifier()
pass_tf.fit(tfidf_train_3, y_train)

with open("models/tfidf_trigrams_model.pkl", "wb") as file:
    pickle.dump(pass_tf, file)
   
with open("vectorizers/tfidf_vectorizer3.pkl", "wb") as f:
    pickle.dump(tfidf_vectorizer3, f)
    
pred = pass_tf.predict(tfidf_test_3)
tri_tfidf_score = metrics.accuracy_score(y_test, pred)
print("Accuracy:   %0.3f" % tri_tfidf_score)
Accuracy:   0.962
In [ ]:
tfidf_vectorizer3 = TfidfVectorizer(stop_words='english', max_df=0.8, ngram_range=(1,3)) #n-grams: string of elements like text 
tfidf_train_3 = tfidf_vectorizer3.fit_transform(X_train)
tfidf_test_3 = tfidf_vectorizer3.transform(X_test)

from sklearn.svm import LinearSVC
pass_tf = LinearSVC()
pass_tf.fit(tfidf_train_3, y_train)
    
pred = pass_tf.predict(tfidf_test_3)
tri_tfidf_score = metrics.accuracy_score(y_test, pred)
print("Accuracy:   %0.3f" % tri_tfidf_score)
Accuracy:   0.962
In [ ]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, pred)
sns.heatmap(confusion_matrix(y_test, pred), annot=True, fmt='d', cmap='Blues', cbar=False)
Out[ ]:
<Axes: >
No description has been provided for this image
In [ ]:
import matplotlib.pyplot as plt

# Create a bar graph to compare the accuracy of each model
models = ['multinomial_nb_model', 'passive_aggressive_model', 'ml_tfidf_model', 'bi_tfidf_model', 'tri_tfidf_model']
accuracy = [naive_bayes_score, pass_aggr_score, ml_tfidf_score, bi_tfidf_score, tri_tfidf_score]

plt.figure(figsize=(8, 5))
plt.bar(models, accuracy, color='skyblue')
plt.xlabel('Models')
plt.ylabel('Accuracy')
plt.title('Accuracy of Different Models')
plt.ylim(0, 1)  # Set the y-axis limits
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7)  # Add horizontal grid lines
for i, v in enumerate(accuracy):
    plt.text(i, v + 0.01, str(round(v, 2)), ha='center', va='bottom', fontsize=9)  # Add text labels on bars
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()
No description has been provided for this image
In [ ]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_validate
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np

# Define pipelines
pipeline_mnb = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('clf', MultinomialNB())
])

pipeline_passive = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('clf', PassiveAggressiveClassifier())
])

pipeline_ml_tfidf = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('clf', PassiveAggressiveClassifier())
])

pipeline_bi_tfidf = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1, 2))),
    ('clf', PassiveAggressiveClassifier())
])

pipeline_tri_tfidf = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1, 3))),
    ('clf', PassiveAggressiveClassifier())
])

# Define the metrics to be used for evaluation
scoring = {'accuracy': 'accuracy',
           'precision_macro': 'precision_macro',
           'recall_macro': 'recall_macro',
           'f1_macro': 'f1_macro'}

# Perform cross-validation for each model
cv_results_mnb = cross_validate(pipeline_mnb, X_train, y_train, cv=5, scoring=scoring)
cv_results_passive = cross_validate(pipeline_passive, X_train, y_train, cv=5, scoring=scoring)
cv_results_ml_tfidf = cross_validate(pipeline_ml_tfidf, X_train, y_train, cv=5, scoring=scoring)
cv_results_bi_tfidf = cross_validate(pipeline_bi_tfidf, X_train, y_train, cv=5, scoring=scoring)
cv_results_tri_tfidf = cross_validate(pipeline_tri_tfidf, X_train, y_train, cv=5, scoring=scoring)

# Calculate mean scores for each metric
mean_scores_mnb = {metric: np.mean(cv_results_mnb[f'test_{metric}']) for metric in scoring}
mean_scores_passive = {metric: np.mean(cv_results_passive[f'test_{metric}']) for metric in scoring}
mean_scores_ml_tfidf = {metric: np.mean(cv_results_ml_tfidf[f'test_{metric}']) for metric in scoring}
mean_scores_bi_tfidf = {metric: np.mean(cv_results_bi_tfidf[f'test_{metric}']) for metric in scoring}
mean_scores_tri_tfidf = {metric: np.mean(cv_results_tri_tfidf[f'test_{metric}']) for metric in scoring}

# Print mean scores for each model
print("Mean scores for Multinomial Naive Bayes:")
print(mean_scores_mnb)
print()
print("Mean scores for Passive Aggressive:")
print(mean_scores_passive)
print()
print("Mean scores for Machine Learning TF-IDF:")
print(mean_scores_ml_tfidf)
print()
print("Mean scores for Bigram TF-IDF:")
print(mean_scores_bi_tfidf)
print()
print("Mean scores for Trigram TF-IDF:")
print(mean_scores_tri_tfidf)
Mean scores for Multinomial Naive Bayes:
{'accuracy': 0.7754161259553705, 'precision_macro': 0.9163488566352612, 'recall_macro': 0.49411498534902076, 'f1_macro': 0.5839305004348863}

Mean scores for Passive Aggressive:
{'accuracy': 0.9273760134387707, 'precision_macro': 0.9066018673527131, 'recall_macro': 0.8948885803209616, 'f1_macro': 0.9004285938528789}

Mean scores for Machine Learning TF-IDF:
{'accuracy': 0.9277041668305636, 'precision_macro': 0.9056174534166935, 'recall_macro': 0.8953284954398164, 'f1_macro': 0.9001327700610517}

Mean scores for Bigram TF-IDF:
{'accuracy': 0.9476314053041353, 'precision_macro': 0.9347807122733449, 'recall_macro': 0.9205362632516806, 'f1_macro': 0.9273060137576874}

Mean scores for Trigram TF-IDF:
{'accuracy': 0.949316879394401, 'precision_macro': 0.9363340105153597, 'recall_macro': 0.9223345140017841, 'f1_macro': 0.9289027141028855}
In [ ]:
'''
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

# Load dataset
data = fetch_20newsgroups(subset='all', shuffle=True, random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Define models
models = {
    "LogisticRegression": LogisticRegression(),
    "RidgeClassifier": RidgeClassifier(),
    "MultinomialNB": MultinomialNB(),
    "LinearSVC": LinearSVC()
}

# Vectorize data
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Results list to store model metrics
results = []

# Iterate over models
for model_name, model in models.items():
    # Train model
    model.fit(X_train_vec, y_train)

    # Test model
    y_pred = model.predict(X_test_vec)

    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')

    # Append results
    results.append([model_name, precision, recall, f1, accuracy])

# Convert results to DataFrame
results_df = pd.DataFrame(results, columns=["Model", "Precision", "Recall", "F1", "Accuracy"])
    
# Print table
print("TABLE BAG-OF-WORDS")
print(results_df)
'''
TABLE BAG-OF-WORDS
                Model  Precision    Recall        F1  Accuracy
0  LogisticRegression  0.877707   0.877454  0.876988  0.877454
1  RidgeClassifier     0.870365   0.868170  0.868831  0.868170
2  MultinomialNB       0.869713   0.851194  0.840494  0.851194
3  LinearSVC           0.886864   0.886472  0.886273  0.886472
In [ ]:
# mnb: Multinomial Naive Bayes model
# passive: Passive Aggressive Classifier model

# Train the models
mnb.fit(count_train, y_train)
passive.fit(count_train, y_train)

# Get predictions for both training and test sets
train_pred_mnb = mnb.predict(count_train)
test_pred_mnb = mnb.predict(count_test)

train_pred_passive = passive.predict(count_train)
test_pred_passive = passive.predict(count_test)

# Calculate the evaluation metrics for both training and test sets
train_accuracy_mnb = metrics.accuracy_score(y_train, train_pred_mnb)
test_accuracy_mnb = metrics.accuracy_score(y_test, test_pred_mnb)

train_accuracy_passive = metrics.accuracy_score(y_train, train_pred_passive)
test_accuracy_passive = metrics.accuracy_score(y_test, test_pred_passive)

# Plot the results
plt.figure(figsize=(10, 5))

# Multinomial Naive Bayes
plt.subplot(1, 2, 1)
plt.bar(['Train Accuracy', 'Test Accuracy'], [train_accuracy_mnb, test_accuracy_mnb], color=['blue', 'orange'])
plt.ylim(0, 1)
plt.title('Multinomial Naive Bayes')

# Passive Aggressive Classifier
plt.subplot(1, 2, 2)
plt.bar(['Train Accuracy', 'Test Accuracy'], [train_accuracy_passive, test_accuracy_passive], color=['blue', 'orange'])
plt.ylim(0, 1)
plt.title('Passive Aggressive Classifier')

plt.suptitle('Training vs Test Accuracy')
plt.show()
No description has been provided for this image

Drug Recommendation¶

In [ ]:
df_drug = df[(df['rating']>=9)&(df['usefulCount']>=100)].sort_values(by = ['rating', 'usefulCount'], ascending = [False, False])
In [ ]:
def recommend_drug(disease):
    recommended_drug_list = df_drug[df_drug['condition']==disease]['drugName'].head(3).tolist()
    return recommended_drug_list
In [ ]:
recommend_drug("GERD")
Out[ ]:
['Zantac 150', 'Ranitidine', 'Zantac']

Predictions¶

In [ ]:
# df_test = df.groupby(['condition','drugName']).agg({'total_pred' : ['mean']})
# df_test
In [ ]:
# tfidf_trigrams_model has highest accuracy

with open('models/tfidf_trigrams_model.pkl', 'rb') as f:
    tfidf_trigrams_model = pickle.load(f)

with open('vectorizers/tfidf_vectorizer3.pkl', 'rb') as f:
    tfidf_vectorizer = pickle.load(f)
In [ ]:
text = ["Increased thirst. Frequent urination. Increased hunger. Unintended weight loss. Fatigue. Blurred vision. Slow-healing sores. Frequent infections."]
text_transformed = tfidf_vectorizer.transform(text)
prediction = tfidf_trigrams_model.predict(text_transformed)[0]
print("Predicted disease :", prediction)
recommend_drug(prediction)
Predicted disease : Diabetes, Type 2
Out[ ]:
['Victoza', 'Liraglutide', 'Canagliflozin']
In [ ]:
text = ["Difficulty falling asleep at night. Waking up during the night. Waking up too early. Not feeling well-rested after a night's sleep. Daytime tiredness or sleepiness. Irritability, depression or anxiety. Difficulty paying attention, focusing on tasks or remembering. Increased errors or accidents."]
text_transformed = tfidf_vectorizer.transform(text)
prediction = tfidf_trigrams_model.predict(text_transformed)[0]
print("Predicted disease recommend_drug(prediction):", prediction)
recommend_drug(prediction)
Predicted disease recommend_drug(prediction): Insomnia
Out[ ]:
['Trazodone', 'Clonazepam', 'Remeron']
In [ ]:
text = ["Crusting of skin bumps. Cysts. Papules (small red bumps) Pustules (small red bumps containing white or yellow pus) Redness around the skin eruptions. Scarring of the skin. Whiteheads. Blackheads."]
text_transformed = tfidf_vectorizer.transform(text)
prediction = tfidf_trigrams_model.predict(text_transformed)[0]
print("Predicted disease :", prediction)
recommend_drug(prediction)
Predicted disease : Acne
Out[ ]:
['Tretinoin', 'Spironolactone', 'Differin']
In [ ]:
text = ["Spotting between periods. Breakthrough bleeding, or spotting, refers to when vaginal bleeding occurs between menstrual cycles. Nausea. Breast tenderness. Headaches and migraine. Weight gain. Mood changes. Missed periods. Decreased libido."]
text_transformed = tfidf_vectorizer.transform(text)
prediction = tfidf_trigrams_model.predict(text_transformed)[0]
print("Predicted disease :", prediction)
recommend_drug(prediction)
Predicted disease : Birth Control
Out[ ]:
['Mirena', 'Levonorgestrel', 'Implanon']
In [ ]:
text = ["A burning sensation in your chest (heartburn), usually after eating, which might be worse at night or while lying down. Backwash (regurgitation) of food or sour liquid. Upper abdominal or chest pain. Trouble swallowing (dysphagia) Sensation of a lump in your throat."]
text_transformed = tfidf_vectorizer.transform(text)
prediction = tfidf_trigrams_model.predict(text_transformed)[0]
print("Predicted disease :", prediction)
recommend_drug(prediction)
Predicted disease : GERD
Out[ ]:
['Zantac 150', 'Ranitidine', 'Zantac']
In [ ]: